Skip to content

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160

Open
JC-ut0 wants to merge 2 commits intojd-opensource:mainfrom
JC-ut0:gdn_cache_fix
Open

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
JC-ut0 wants to merge 2 commits intojd-opensource:mainfrom
JC-ut0:gdn_cache_fix

Conversation

@JC-ut0
Copy link
Copy Markdown
Contributor

@JC-ut0 JC-ut0 commented Apr 1, 2026

Add logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactor WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.

torch::dtype(dtype_).device(device_)),
2);
}
#elif defined(USE_ILU) || defined(USE_MLU) || defined(USE_MUSA)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #elif defined(USE_ILU) || defined(USE_MLU) || defined(USE_MUSA) and #else branches appear to have identical behavior — what's the reason for splitting them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the original code style below.

Removed unused layer types variable from worker_impl.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants